In this course the students are getting an introduction in the interesting world of data sciences. In this course the tools which are mostly used are RStudio, R markdown and GitHub.
Here is the link to my personal github repository
Lets read the data from given site. Then we take a look for the data structure.
lrn14 <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt", sep=",", header=TRUE)
str(lrn14)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
dim(lrn14)
## [1] 166 7
We have seven variables and 166 observations. As variables we have gender and age of the respondent, attitude toward statistics, deep learning, stratetig learning, surfage learning and exam points. Link to meta text is in here.
Lets have a brief graphical overview to data.
# access the GGally and ggplot2 libraries
library(GGally)
library(ggplot2)
# create a more advanced plot matrix with ggpairs()
p <- ggpairs(lrn14, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
# draw the plot
p
From graph above we can see how variables are distributed and how variables are related to each others. Colours used in graph are signaling the gender (Blue means male, red female). We can see that data set has more women than men (110 females, 56 males). Mostly distributions do not differ significantly between the genders, only in attitude distribution profile is significanty different, also in stratetig learning and surfage learning means look to differ.
Here is the summary of the data.
summary(lrn14)
## gender age attitude deep stra
## F:110 Min. :17.00 Min. :1.400 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :3.200 Median :3.667 Median :3.188
## Mean :25.51 Mean :3.143 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :5.000 Max. :4.917 Max. :5.000
## surf points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
Lets define exam points as dependent variable. Variables with highest correlation coefficients related to exam points are attitude, stratetig learning and surface learning. They are the independent variables in our model
\[Y_{i} = \beta_{0} + \beta_{1}X_{1i} + \beta_{2}X_{2i} + \beta_{3}X_{3i} + \varepsilon_{i}\] where \(Y_{i}\) is exam points, \(X_{1i}\) is attitude, \(X_{2i}\) is stratetig learning and \(X_{3i}\) is surface learning.
Here are the plotted values of each variable:
qplot(attitude, points, data = lrn14, aes(col = gender)) + geom_smooth(method = "lm") + ggtitle("Regressor Attitude")
qplot(stra, points, data = lrn14, aes(col = gender)) + geom_smooth(method = "lm") + ggtitle("Regressor Stratetig Learning")
qplot(surf, points, data = lrn14, aes(col = gender)) + geom_smooth(method = "lm") + ggtitle("Regressor Surface Learning")
As the correlation coefficients show, variable exam points has positive relation with attitude and strategic learning and negative relation with surface learning.
Lets analyse our linear model \(Y_{i} = \beta_{0} + \beta_{1}X_{1i} + \beta_{2}X_{2i} + \beta_{3}X_{3i} + \varepsilon_{i}\).
At first we must check the model assumptions:
model <- lm(points ~ attitude + stra + surf, data = lrn14)
plot(model, which = c(1,2,5))
Residuals looks to have mean zero, points are mostly normally distributed and there are no points with significally great leverage.
Here is the summary of the model:
summary(model)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = lrn14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
Regressor attitude is only statistically significant independent variable in 10 % confidence level in the model. This model has adjusted \(R^2\) value 0.19 which means that the model explains approximately 20 % of the exam poins deviation. With F-test we can check how likely both stategic and surface learning’s coefficients are equal zero.
library(lmtest)
library(sandwich)
library(car)
linearHypothesis(model, c("surf = 0", "stra = 0"))
## Linear hypothesis test
##
## Hypothesis:
## surf = 0
## stra = 0
##
## Model 1: restricted model
## Model 2: points ~ attitude + stra + surf
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 164 4641.1
## 2 162 4544.4 2 96.743 1.7244 0.1815
With p-value 0.18 we can not reject the null hypothesis “surf = 0”, “stra = 0”, so we can make model without those two variables.
model2 <- lm(points ~ attitude, data = lrn14)
Checking the assumptions:
plot(model2, which = c(1,2,5))
Taking of the variables strategic learning and surface learning do not make big difference to original model.
Summary of the model \[Y_{i} = \beta_{0} + \beta_{1}X_{1i} + \varepsilon_{i}\]
summary(model2)
##
## Call:
## lm(formula = points ~ attitude, data = lrn14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
In this model adjusted \(R^2\) is only slightly smaller than in original model. Also estimated OLS-coefficient for attitude, 0.57, has not changed significantly. This parameter is statistically significant in under 0.1% confidence level. Estimated value for attitude means that when attitude points increase by 1, then exam points expected increase is 3.5.
Read data from own data folder.
alc <- read.csv("C:/Users/juhol/OneDrive/Documents/GitHub/IODS-project/data/alc.csv", header = T, sep = ",")
alc <- alc[,-1] # First column only duplicates the row numbers
dim(alc)
## [1] 382 35
str(alc)
## 'data.frame': 382 obs. of 35 variables:
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
## $ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
## $ Fjob : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
## $ reason : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
## $ nursery : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
## $ internet : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : int 0 0 2 0 0 0 0 0 0 0 ...
## $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
## $ famsup : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
## $ paid : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
## $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
## $ higher : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : int 5 3 8 1 2 8 0 4 0 0 ...
## $ G1 : int 2 7 10 14 8 14 12 8 16 13 ...
## $ G2 : int 8 8 10 14 12 14 12 9 17 14 ...
## $ G3 : int 8 8 11 14 12 14 12 10 18 14 ...
## $ alc_use : num 1 1 2.5 1 1.5 1.5 1 1 1 1 ...
## $ high_use : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
summary(alc)
## school sex age address famsize Pstatus
## GP:342 F:198 Min. :15.00 R: 81 GT3:278 A: 38
## MS: 40 M:184 1st Qu.:16.00 U:301 LE3:104 T:344
## Median :17.00
## Mean :16.59
## 3rd Qu.:17.00
## Max. :22.00
## Medu Fedu Mjob Fjob
## Min. :0.000 Min. :0.000 at_home : 53 at_home : 16
## 1st Qu.:2.000 1st Qu.:2.000 health : 33 health : 17
## Median :3.000 Median :3.000 other :138 other :211
## Mean :2.806 Mean :2.565 services: 96 services:107
## 3rd Qu.:4.000 3rd Qu.:4.000 teacher : 62 teacher : 31
## Max. :4.000 Max. :4.000
## reason nursery internet guardian traveltime
## course :140 no : 72 no : 58 father: 91 Min. :1.000
## home :110 yes:310 yes:324 mother:275 1st Qu.:1.000
## other : 34 other : 16 Median :1.000
## reputation: 98 Mean :1.448
## 3rd Qu.:2.000
## Max. :4.000
## studytime failures schoolsup famsup paid activities
## Min. :1.000 Min. :0.0000 no :331 no :144 no :205 no :181
## 1st Qu.:1.000 1st Qu.:0.0000 yes: 51 yes:238 yes:177 yes:201
## Median :2.000 Median :0.0000
## Mean :2.037 Mean :0.2016
## 3rd Qu.:2.000 3rd Qu.:0.0000
## Max. :4.000 Max. :3.0000
## higher romantic famrel freetime goout
## no : 18 no :261 Min. :1.000 Min. :1.00 Min. :1.000
## yes:364 yes:121 1st Qu.:4.000 1st Qu.:3.00 1st Qu.:2.000
## Median :4.000 Median :3.00 Median :3.000
## Mean :3.937 Mean :3.22 Mean :3.113
## 3rd Qu.:5.000 3rd Qu.:4.00 3rd Qu.:4.000
## Max. :5.000 Max. :5.00 Max. :5.000
## Dalc Walc health absences
## Min. :1.000 Min. :1.000 Min. :1.000 Min. : 0.0
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:3.000 1st Qu.: 1.0
## Median :1.000 Median :2.000 Median :4.000 Median : 3.0
## Mean :1.482 Mean :2.296 Mean :3.573 Mean : 4.5
## 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.:5.000 3rd Qu.: 6.0
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :45.0
## G1 G2 G3 alc_use
## Min. : 2.00 Min. : 4.00 Min. : 0.00 Min. :1.000
## 1st Qu.:10.00 1st Qu.:10.00 1st Qu.:10.00 1st Qu.:1.000
## Median :12.00 Median :12.00 Median :12.00 Median :1.500
## Mean :11.49 Mean :11.47 Mean :11.46 Mean :1.889
## 3rd Qu.:14.00 3rd Qu.:14.00 3rd Qu.:14.00 3rd Qu.:2.500
## Max. :18.00 Max. :18.00 Max. :18.00 Max. :5.000
## high_use
## Mode :logical
## FALSE:268
## TRUE :114
##
##
##
Data icludes 382 observations and 35 variables. Data is combined from two datasets which are loaded from here. Site includes also the meta text for data which is usefull to read. R-code for data wrangling can be found in here and data is available in here.
Packages used in this section:
library(dplyr)
library(tidyverse)
library(ggplot2)
library(GGally)
library(cowplot)
library(boot)
Lets introduce the column names of our data.
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
In our analyse section we are interested in relationships between alcohol consumption and social parameters. Lets choose variable “high use of alcohol” (binary) as the dependent variable of interest and “extra-curricular activities” (binary), “romantic relationship” (binary), “quality of family relationships” (numeric: from 1 - very bad to 5 - excellent) and “going out with friends” (numeric: from 1 - very low to 5 - very high) as social parameters of interest. Lets also control the age and sex.
# Lets make subset of data with variables "sex", "age", "activities", "romantic", "famrel", "goout" and "high_use".
variables_sub <- c("sex", "age", "activities", "romantic", "famrel", "goout", "high_use")
alc_sub <- select(alc, one_of(variables_sub))
str(alc_sub)
## 'data.frame': 382 obs. of 7 variables:
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ high_use : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
Lets plot alcohol use which each of these variables.
alc_sex <- ggplot(data = alc_sub, aes(x = sex, fill = high_use)) + geom_bar() + ggtitle("Consumption of alcohol and sex")
alc_sex
alc_age <- ggplot(data = alc_sub, aes(x = age, fill = high_use)) + geom_bar() + ggtitle("Consumption of alcohol and age")
alc_age
alc_act <- ggplot(data = alc_sub, aes(x = activities, fill = high_use)) + geom_bar() + ggtitle("Consumption of alcohol and activities")
alc_act
alc_rom <- ggplot(data = alc_sub, aes(x = romantic, fill = high_use)) + geom_bar() + ggtitle("Consumption of alcohol and romantic relationship")
alc_rom
alc_family <- ggplot(data = alc_sub, aes(x = famrel, fill = high_use)) + geom_bar() + ggtitle("Consumption of alcohol and quality of family relationship")
alc_family
alc_goout <- ggplot(data = alc_sub, aes(x = goout, fill = high_use)) + geom_bar() + ggtitle("Consumption of alcohol and going out with friends")
alc_goout
It looks clear that high consuming of alcohol is more common for males than for females. Interesting fact in this sample is that all age groups from 15 to 18 (where we have better sample) are relatively having same part of high users. It also looks clear that in the group of students having extra-curricular activities, the heavy using is not as common as in other group. Only by looking the histogram, it looks that better quality of family relationships decreases the risk of high consumption of alcohol. Not suprisingly the group of students who are spending more time in out with friends is also consuming more alcohol.
Lets looking this in more detail. Lets start from GLM-model \[\text{HighUse}_{i} = \beta_{0} + \beta_{1}\text{Sex}_{i} + \beta_{2}\text{Age}_{i} + \beta_{3}\text{Activities}_{i} + \beta_{4}\text{Romantic}_{i} + \beta_{5}\text{Family}_{i} + \beta_{6}\text{GoOut}_{i} + \epsilon_{i}\] and then using the backward method for finding the best candidate for the best linear model.
# First model
glm1 <- glm(high_use ~ sex + age + activities + romantic + famrel + goout, data = alc_sub, family = "binomial")
step(glm1, direction = "backward")
## Start: AIC=402.06
## high_use ~ sex + age + activities + romantic + famrel + goout
##
## Df Deviance AIC
## - romantic 1 389.00 401.00
## - age 1 389.99 401.99
## <none> 388.06 402.06
## - activities 1 390.86 402.86
## - famrel 1 397.30 409.30
## - sex 1 403.62 415.62
## - goout 1 436.48 448.48
##
## Step: AIC=401
## high_use ~ sex + age + activities + famrel + goout
##
## Df Deviance AIC
## - age 1 390.62 400.62
## <none> 389.00 401.00
## - activities 1 391.99 401.99
## - famrel 1 397.84 407.84
## - sex 1 405.01 415.01
## - goout 1 437.32 447.32
##
## Step: AIC=400.62
## high_use ~ sex + activities + famrel + goout
##
## Df Deviance AIC
## <none> 390.62 400.62
## - activities 1 393.80 401.80
## - famrel 1 399.15 407.15
## - sex 1 406.47 414.47
## - goout 1 442.61 450.61
##
## Call: glm(formula = high_use ~ sex + activities + famrel + goout, family = "binomial",
## data = alc_sub)
##
## Coefficients:
## (Intercept) sexM activitiesyes famrel goout
## -2.2313 0.9971 -0.4494 -0.4050 0.8118
##
## Degrees of Freedom: 381 Total (i.e. Null); 377 Residual
## Null Deviance: 465.7
## Residual Deviance: 390.6 AIC: 400.6
By Akaike information criterion the best model is the following: \[\text{HighUse}_{i} = \beta_{0} + \beta_{1}\text{Sex}_{i} + \beta_{2}\text{Activities}_{i} + \beta_{3}\text{Family}_{i} + \beta_{4}\text{GoOut}_{i} + \epsilon_{i}\]
This model has \(\text{AIC} = 400.6\).
glm_best <- glm(formula = high_use ~ sex + activities + famrel + goout, family = "binomial",
data = alc_sub)
summary(glm_best)
##
## Call:
## glm(formula = high_use ~ sex + activities + famrel + goout, family = "binomial",
## data = alc_sub)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5303 -0.7884 -0.5366 0.8617 2.6530
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.2313 0.6476 -3.445 0.00057 ***
## sexM 0.9971 0.2559 3.896 9.77e-05 ***
## activitiesyes -0.4494 0.2532 -1.775 0.07588 .
## famrel -0.4050 0.1396 -2.902 0.00371 **
## goout 0.8118 0.1229 6.603 4.03e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 390.62 on 377 degrees of freedom
## AIC: 400.62
##
## Number of Fisher Scoring iterations: 4
We can calculate \(\text{McFadden's Pseudo} R^2\) representing the overall effect size of the model (more about from here):
ll.null <- glm_best$null.deviance/-2
ll.proposed <- glm_best$deviance/-2
pseudo_r_sqr <- (ll.null-ll.proposed)/ll.null
pseudo_r_sqr
## [1] 0.1611876
ll.null2 <- glm1$null.deviance/-2
ll.proposed2 <- glm1$deviance/-2
pseudo_r_sqr_org <- (ll.null2-ll.proposed2)/ll.null2
pseudo_r_sqr_org
## [1] 0.1666838
We get that \(\text{McFadden's Pseudo} R^2 = 0.16\) which is relatively small. The original model has \(\text{McFadden's Pseudo} R^2 = 0.17\) so having extra variables does not increase the effect size significantly.
In this model all other variables (including the intercept) but “extra-curricular activities” (binary) are statistically significant (counted by Wald-test) in 99 % confidence level. Also variable “extra-curricular activities” is significant in 90 % confidence level. All the coefficients are in “log(odds)” -format. With binary variables, coefficient is represinting the “log(odds ratio)” -format, for example variable sex has \(log(\text{odd ratio}) = 0.9971 \Leftrightarrow \text{odd ratio} = 2.71041\). The social variables which were the variables of our interest, having extra-curricular activities has an negative effect to high consumption of alcohol as the good family relationships also does, both with log(odd) -estimation around -0.40 - -0.50. Going out with friends has a positive effect to heavy drinking, \(log(\text{odd}) = 0.8118 \Leftrightarrow \text{odd} = 2.251958\), meaning that approximately the ratio between heavy drinkers of those who go out and those how don’t is \(\frac{5}{2}\).
We can provide 2x2 cross tabulation of predictions versus the actual values. Lets also have a visual look up, how our model “fitted” with data.
## Predicted values of our model
predicted_data <- data.frame(
probability_of_hc = glm_best$fitted.values,
high_consumption = alc_sub$high_use)
predicted_data <- mutate(predicted_data, prediction = probability_of_hc > 0.5)
predicted_data <- predicted_data[ order(predicted_data$probability_of_hc, decreasing=FALSE),]
predicted_data$rank <- 1:nrow(predicted_data)
# head(predicted_data)
# tabulate the target variable versus the predictions
table(high_use = predicted_data$high_consumption, prediction = predicted_data$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 254 14
## TRUE 75 39
# Plotting
ggplot(data=predicted_data, aes(x=rank, y=probability_of_hc)) +
geom_point(aes(color=high_consumption), alpha=1, shape=4, stroke=2) +
xlab("Index") +
ylab("Predicted probability of high consumption of alcohol")
We see that our model with social parameters controlled by sex made average quality predictions, data is mainly distributed in right way. It can be assumed from the small overall effect size of the model (only 0.17 as shown above) that it would not perfectly generate the right predictions. Also what we have to take into account is that the model doesn’t tell us about the causalitys. For example the causality between high consumption of alcohol and going out with friends has probably a causality in both ways.
Lets do this also with cross-validation for the model.
# define a loss function (average prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# Average number of wrong predictions in the (training) data
loss_func(predicted_data$high_consumption, predicted_data$probability_of_hc)
## [1] 0.2329843
# 10-fold cross-validation
cv <- cv.glm(data = alc_sub, cost = loss_func, glmfit = glm_best, K = 10)
# Average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2408377
# For comparing, lets try probabilities P(High_use) = 1 and 0
# P(High_use) = 1
loss_func(predicted_data$high_consumption, prob = 1)
## [1] 0.7015707
# P(High_use) = 0
loss_func(predicted_data$high_consumption, prob = 0)
## [1] 0.2984293
Average number of wrong predictions in the training data is 0.23 and 10-fold cross-validation gives 0.24. Our model gives better predictions than just assuming the probability of high consumption to be zero (loss function gives 0.3) or one (loss function gives 0.7).
Here are the packages used in this session:
# REMOVE ALL OBJECTS
rm(list=ls())
# access the MASS package
library(MASS)
library(corrplot)
library(tidyverse)
library(ggplot2)
library(GGally)
library(plotly)
Let’s load the Boston data from the MASS package. Data is all about the “Housing Values in Suburbs of Boston”. More about the data can be read from here.
# load the data
data("Boston")
# explore the dataset
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
# plot matrix of the variables
ggpairs(Boston, lower = "blank", upper = list(continuous = "points", combo =
"facethist", discrete = "facetbar", na = "na"))
Boston dataset has 506 observations and 14 variables. All variables are in numeric format. Variable chas is a dummy variable (Charles River dummy variable, 1 if tract bounds river and 0 otherwise) and variable rad is an index (index of accessibility to radial highways).
Only by waching the plot above, it looks that the data has some linear relations. These relations can bee seen with correlation matrix.
# calculate the correlation matrix and round it
cor_matrix<-cor(Boston)
# print the correlation matrix
cor_matrix %>% round(digits = 2)
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
# visualize the correlation matrix
##corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
corrplot(cor_matrix, method="circle", tl.pos = "d")
Some high correlations between variables can be seen, for example between index of accessibility to radial highways and full-value property-tax rate per $10,000 (0.91), proportion of owner-occupied units built prior to 1940 and weighted mean of distances to five Boston employment centres (-0.75) and nitrogen oxides concentration and proportion of non-retail business acres per town (-0.77).
For further analysis, we will standardize the data. Let’s have a same kind of lookup in the data after we have standadized the data.
# center and standardize variables
boston_std <- scale(Boston)
# summaries of the scaled variables
summary(boston_std)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
# class of the boston_scaled object
class(boston_std)
## [1] "matrix"
# change the object to data frame
boston_scaled <- as.data.frame(boston_std)
From summary we can see that we have succesfully standardized the data. All the variables have mean zero and deviation is standardized. Let’s analyse the crime rate variable deeper.
summary(boston_scaled$crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419367 -0.410563 -0.390280 0.000000 0.007389 9.924110
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c("low", "med_low", "med_high", "high"))
# look at the table of the new factor crime
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
As wee see from crime table above, all quantiles have same number of samples as it should be. Let’s divide the dataset to train and test sets, so that 80% of the data belongs to the train set.
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
Next we will fit the linear discriminant analysis on the train set. We use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables.
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2400990 0.2574257 0.2623762 0.2400990
##
## Group means:
## zn indus chas nox rm
## low 0.88571224 -0.9631730 -0.150563081 -0.8745198 0.4246783
## med_low -0.09413106 -0.2665609 -0.007331936 -0.5443643 -0.1504400
## med_high -0.38045185 0.1737650 0.247665303 0.3454622 0.1183547
## high -0.48724019 1.0149946 -0.069385757 1.0483486 -0.4624321
## age dis rad tax ptratio
## low -0.8415368 0.8564521 -0.7036343 -0.7469372 -0.49088999
## med_low -0.2989224 0.3461849 -0.5511961 -0.4679786 -0.07406251
## med_high 0.3533641 -0.3584289 -0.3957196 -0.3021079 -0.26401188
## high 0.8374969 -0.8558762 1.6596029 1.5294129 0.80577843
## black lstat medv
## low 0.3757553 -0.73954054 0.50544573
## med_low 0.3133960 -0.11330541 -0.03440868
## med_high 0.1300681 -0.01921482 0.19901930
## high -0.7071548 0.85232768 -0.65303188
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.16145026 0.698237380 -0.88732882
## indus 0.04387196 -0.376136375 0.46027937
## chas -0.08020820 -0.062763080 0.08393346
## nox 0.31357800 -0.798142406 -1.45281309
## rm -0.11094000 -0.108663576 -0.17207489
## age 0.31610729 -0.129583154 -0.04722198
## dis -0.06841277 -0.216749362 0.17448293
## rad 3.23282481 0.895792406 -0.06893070
## tax -0.08602375 0.059117851 0.53600662
## ptratio 0.14480039 0.009524140 -0.29849017
## black -0.17145748 -0.005977023 0.10933614
## lstat 0.13850907 -0.249121627 0.21711326
## medv 0.13739856 -0.373693189 -0.31066216
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9514 0.0362 0.0124
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col="orange", length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col= "orange", pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)
From the plot and summary above we can see that variable index of accessibility to radial highways has the greatest seperation weight related to other variables, both in LD1 and LD2 dimensions. In LDI coefficient for index of accessibility to radial highways is 3.4 when second greatest weight, 0.37, is variables nitrogen oxides concentration. In dimension LD2 variable index of accessibility to radial highways does not have such a dominance as in LD1. In both dimensions it is positively correlated with crime rate. In LD2 coefficient for it is 0.9, when for example variables proportion of residential land zoned for lots over 25,000 sq.ft. and nitrogen oxides concentration (parts per 10 million) have both coefficient around 0.7 as in absolute value, but these LD2 components have oppposite directions (nitrogen oxides concentration (parts per 10 million) increases criminal rate). In our LD1-LD2 map moving south-east inreases the criminal rate.
For next we will test how well our model predicts the crime rate. Earlier in our analysis we separated our data in learning and testing parts and we removed categorial variable crime from test data.
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 19 9 2 0
## med_low 4 15 3 0
## med_high 0 6 14 0
## high 0 0 1 29
In table above we see that our model predicts high crime rate very well. Also in other rate classes our model makes quite good predictions, even in classes med_high and med_low.
Let’s use original standardized data with original crime rate variable. Then we will alalyse data with K-means clustering method.
boston_std <- as.data.frame(boston_std)
boston_dist <- dist(boston_std)
set.seed(123)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_std, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
Based on graph above we decide to use two clusters.
# K-means clustering
km <-kmeans(boston_std, centers = 2)
# plot the Boston dataset with clusters
#pairs(Boston, col = km$cluster)
class(km$cluster)
## [1] "integer"
ggpairs(boston_std, mapping = aes(col = as.factor(km$cluster), alpha = 0.3), lower = "blank", upper = list(continuous = "points", combo =
"facethist", discrete = "facetbar", na = "na"))
From the plot above we see that k-means clustering works quite well, in most cases those two clusters are enough to seperate the data to two resonable distributions. For example, in the case of nitrogen oxides concentration (nox) we see quite clearly two different distributions.
For a bonus analyse we will search for a reasonable k-mean clustering for original, standardized Manhattan data. Then we will use these clusters as target variables in new LDA-model and find the variables which have most weight in clustering.
boston_scaled_j <- scale(Boston)
boston_scaled_j <- as.data.frame(boston_scaled_j)
# calculate the total within sum of squares
twcss_j <- sapply(1:k_max, function(k){kmeans(boston_scaled_j, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss_j, geom = 'line')
## Lets use 7 clusters
km_j <-kmeans(boston_scaled_j, centers = 7)
lda.fit_j <- lda(km_j$cluster ~ ., data = boston_scaled_j)
lda.fit_j
## Call:
## lda(km_j$cluster ~ ., data = boston_scaled_j)
##
## Prior probabilities of groups:
## 1 2 3 4 5 6
## 0.19960474 0.17984190 0.06719368 0.27272727 0.08498024 0.07509881
## 7
## 0.12055336
##
## Group means:
## crim zn indus chas nox rm
## 1 -0.3276033 -0.4819336 0.4539433 -0.2723291 0.4033614 -0.4554552
## 2 0.6891277 -0.4872402 1.0922021 -0.2723291 0.9675559 -0.5127412
## 3 -0.1985497 -0.2602436 0.2799956 3.6647712 0.3830784 0.2756681
## 4 -0.3997875 -0.1254255 -0.6189159 -0.2723291 -0.7089634 -0.1617920
## 5 -0.3732407 -0.1422288 -0.8310017 -0.2723291 -0.2005297 1.6901170
## 6 1.9373059 -0.4872402 1.0149946 -0.2723291 0.9805230 -0.2986618
## 7 -0.4142556 2.3574118 -1.1833581 -0.2077864 -1.1903591 0.7260513
## age dis rad tax ptratio black
## 1 0.7139676 -0.5156239 -0.599806921 -0.2986652 0.20347210 0.0769363
## 2 0.7518833 -0.7972862 1.533397731 1.5440835 0.80324049 0.1646510
## 3 0.3721322 -0.4033382 0.001081444 -0.0975633 -0.39245849 0.1715427
## 4 -0.7541779 0.6240671 -0.573249961 -0.6985399 -0.09292879 0.3605290
## 5 0.1841367 -0.3232389 -0.511800977 -0.8155263 -1.10522083 0.3496304
## 6 0.7720736 -0.8769466 1.659602895 1.5294129 0.80577843 -3.0248555
## 7 -1.4158153 1.6302712 -0.671220280 -0.5521444 -0.82906375 0.3536252
## lstat medv
## 1 0.4437738 -0.43288856
## 2 0.8098978 -0.65415404
## 3 -0.1643525 0.57334091
## 4 -0.4176324 0.01266366
## 5 -0.9616241 1.71992796
## 6 1.1866642 -1.08972128
## 7 -0.9679345 0.81083758
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3 LD4 LD5
## crim -0.055858880 -0.28280174 0.39942789 0.62394305 0.07126916
## zn -0.256103484 0.17091712 1.57768449 -0.60309927 -0.13217489
## indus 0.079839746 -0.47024036 -0.16909868 -0.26420738 -0.52074287
## chas 5.832830141 0.13558415 0.33654481 0.06117107 -0.24021698
## nox -0.008708387 0.52783295 -0.21825248 -0.10482010 0.13145221
## rm -0.082216127 -0.06752441 0.09541076 0.18047678 0.61851276
## age 0.084401663 -0.19943772 -0.48493022 0.12288716 0.40058616
## dis 0.057464812 0.42853710 0.33454359 -0.10425677 -0.25117932
## rad 0.178314081 -1.67575671 0.08941298 -0.89338698 0.56080972
## tax 0.075888055 -0.61822344 0.84585974 -0.86928072 0.03445999
## ptratio 0.017381740 -0.15088883 -0.06303904 0.08590537 -0.28534418
## black 0.058211177 0.84962506 -0.76578638 -1.82606194 0.15371287
## lstat -0.123096250 -0.19683160 0.04588153 -0.03043413 0.30872764
## medv -0.195627826 0.33721840 0.09174908 0.15224414 0.89521721
## LD6
## crim 0.08236240
## zn -1.25731674
## indus -0.27992287
## chas 0.06410065
## nox -0.51458192
## rm -0.10578655
## age -0.82552088
## dis 0.31487537
## rad 1.64313640
## tax -0.61832490
## ptratio -0.53422706
## black 0.10592104
## lstat -0.20241633
## medv -0.22594636
##
## Proportion of trace:
## LD1 LD2 LD3 LD4 LD5 LD6
## 0.6103 0.2326 0.0749 0.0498 0.0175 0.0148
# the function for lda biplot arrows
lda.arrows_j <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes_j <- as.numeric(km_j$cluster)
# plot the lda results
plot(lda.fit_j, dimen = 2, col = classes_j, pch = classes_j)
lda.arrows(lda.fit_j, myscale = 1)
Data is well clustered in seven part. Once again variable rad, index of accessibility to radial highways has a major influence in LD1 dimension. In LD2 based dimension variable tax, full-value property-tax rate per $10,000 has the greatest weight.
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
# Let's find k-clusters for train set
n_k <- nrow(Boston)
ind_k <- sample(n_k, size = n_k*0.8)
train_k <- Boston[ind_k,]
train_k <- scale(train_k)
twcss2 <- sapply(1:k_max, function(k){kmeans(stats::na.omit(train_k), k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss2, geom = 'line') # 2
# k-means clustering
km_k <-kmeans(train_k, centers = 2)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = km_k$cluster)
# Let's increase the number of clusters to 4
km_l <-kmeans(train_k, centers = 4)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = km_l$cluster)
In this part we can compare the 3D scatterplots where base vectors are from LDA model made above. All the points in these three 3D scatterplots are in the same cordinates, only the colours differ.The first 3D scatterplot is colored with four crime rates, the second with 2-mean clustering method and the third with 4-mean clustering method. Colours distributions in k-mean clustering plots do not respond the colour distibution of crime rate distribution. It might be that criminal rate does not have as high weight as some other variables in the sense of clustering because it has relatively slight correlation with other variables (this can be seen above).
Here are the packages used in this session:
# REMOVE ALL OBJECTS
rm(list=ls())
library(GGally)
library(ggplot2)
library(dplyr)
library(corrplot)
library(tidyr)
library(FactoMineR)
Let’s import data and have a small lookup in it.
human <- read.csv("C:/Users/juhol/OneDrive/Documents/GitHub/IODS-project/data/human.csv", sep =",", header = T, row.names = 1)
# look at the (column) names of human
names(human)
## [1] "life_exp" "exp_years_education"
## [3] "gni" "maternal_mortality_ratio"
## [5] "adolescent_birth_rate" "female_parliament"
## [7] "sec_edu_ratio" "labour_rate_ratio"
# look at the structure of human
str(human)
## 'data.frame': 155 obs. of 8 variables:
## $ life_exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ exp_years_education : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ gni : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ maternal_mortality_ratio: int 4 6 6 5 6 7 9 28 11 8 ...
## $ adolescent_birth_rate : num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ female_parliament : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
## $ sec_edu_ratio : num 1.007 0.997 0.983 0.989 0.969 ...
## $ labour_rate_ratio : num 0.891 0.819 0.825 0.884 0.829 ...
# print out summaries of the variables
summary(human)
## life_exp exp_years_education gni
## Min. :49.00 Min. : 5.40 Min. : 581
## 1st Qu.:66.30 1st Qu.:11.25 1st Qu.: 4198
## Median :74.20 Median :13.50 Median : 12040
## Mean :71.65 Mean :13.18 Mean : 17628
## 3rd Qu.:77.25 3rd Qu.:15.20 3rd Qu.: 24512
## Max. :83.50 Max. :20.20 Max. :123124
## maternal_mortality_ratio adolescent_birth_rate female_parliament
## Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 49.0 Median : 33.60 Median :19.30
## Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :1100.0 Max. :204.80 Max. :57.50
## sec_edu_ratio labour_rate_ratio
## Min. :0.1717 Min. :0.1857
## 1st Qu.:0.7264 1st Qu.:0.5984
## Median :0.9375 Median :0.7535
## Mean :0.8529 Mean :0.7074
## 3rd Qu.:0.9968 3rd Qu.:0.8535
## Max. :1.4967 Max. :1.0380
In our first part of analysis we are taking a brief look in OECD data from HUMAN DEVELOPMENT REPORT 2015. In this section we are using subset from original data including variables Life expectancy at birth, Expected years of schooling, GNI per capita, Maternal mortality ratio, Adolescent birth rate, Female’s share of parliamentary seats, Female and male ratio of at least secondary education and Female and male ratio labour force participation rates. In dataset we have 155 observations which are representing countries.
All variables are including only numerical values. From summary we can see that all variables are variating relatively highly. Let’s have visual perspective in data.
# visualize the 'human' variables
my_fn <- function(data, mapping, ...){
p <- ggplot(data = data, mapping = mapping) +
geom_point() +
geom_smooth(method=loess, fill="red", color="red", ...) +
geom_smooth(method=lm, fill="blue", color="blue", ...)
p
}
ggpairs(human, lower = list(continuous = my_fn))
# compute the correlation matrix and visualize it with corrplot
cor(human) %>% corrplot
Distributions of variables Life expectancy at birth, GNI per capita, Adolescent birth rate and Female and male ratio labour force participation rates are highly skewed. Remaining part of variables look approximately normally distributed. From plot above we can see two drawn lines, the OLS estimated line and smooth curve. OLS line is representing possible linear relation between variables and smooth curve is estimated by counting the average line. Also correlations are representing linear relations. Clear linear or log-linear relation can be seen between many variables, for example between Life expectancy at birth and Expected years of schooling (corr 0.78), Maternal mortality ratio and Life expectancy at birth (corr -0.86, of corse maternal mortality effect directly to life expectancy) and between Adolescent birth rate and Life expectancy at birth (corr -0.73).
Next we are using principal components analysis (PCA) for human data. At first, we are analysing not standardized human data and after that we are comparing the results with standardized data.PCA is popular approach to reduce the dimensions in data. The main idea in PCA technique is to find the principal components, meaning the directions where observations have most variation. The first principal component is the direction where data is variating the most, the second principal component is the direction where data is variating the second most and etc.
pca_human_not_std <- prcomp(human)
sum_pca_human_not_std <- summary(pca_human_not_std)
# rounded percetanges of variance captured by each PC
pca_pr_not_std <- round(100*sum_pca_human_not_std$importance[2, ], digits = 3)
# print out the percentages of variance
pca_pr_not_std
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 99.99 0.01 0.00 0.00 0.00 0.00 0.00 0.00
# create object pc_lab to be used as axis labels
pc_lab_not_std <- paste0(names(pca_pr_not_std), " (", pca_pr_not_std, "%)")
# draw a biplot
biplot(pca_human_not_std, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab_not_std[1], ylab = pc_lab_not_std[2])
The first principal component is capturing nearly all the variance (99.99%). The GNI variable is nearly only component spanning the PC1. This comes from the fact that GNI is variating between 581 -123124 where other variables variation is dozens of times smaller. That is why we have to standardize our data before using PCA.
human_std <- scale(human)
pca_human <- prcomp(human_std)
sum_pca_human <- summary(pca_human)
# rounded percetanges of variance captured by each PC
pca_pr <- round(100*sum_pca_human$importance[2, ], digits = 3)
# print out the percentages of variance
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 53.605 16.237 9.571 7.583 5.477 3.595 2.634 1.298
# create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
# draw a biplot
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
Now principal components are capturing reasonable amount of variation. The PC1 is capturing 53.6 % of variation, PC2 16.2% and cumulatively the first two PCs are capturing 69.8 % of variation. Interpretation of the graph above is that x-axis is representing the dimension spanned by PC1 and y-axis is correspondly representing the dimension spanned by PC2. Variance of our variables can be represented as linear combination of the principal components. For example, life expectancy’s coefficient in direction PC1 is -0.44 when maternal_mortality_ratio has opposite direction in dimension PC1 with coefficient 0.44. In two first principal components spanned space life expectancy and labour rate ratio are nearly orthogonal meaning that variance of the first is orthogonal related to to others variance (correlation between these two variables is -0.14). In space spanned with fist eight principal components cordinate for life expectancy is \([-0.44372240, 0.02530473, -0.10991305, 0.05834819, -0.1628935, 0.42242796, -0.43406432, -0.62737008]\).
In our bixplot horizontal dimension (PC1) can bee seen as representation of the wealthy of the country and vertical direction (PC2) as equility variance. In the South-West corner of the bixplot we can find Northern Countries and from the really opposite corner countries such as Yemen and Afghanistan.
For making the analyse it is necessary to scale the data. PCA is based only on the variance of the data and normally we are not interested in the differences of variance relating the scale.
Let’s import the Tea data from the package Factominer.
tea <- read.table("http://factominer.free.fr/book/tea.csv",header=TRUE,sep=";")
colnames(tea)
## [1] "breakfast" "afternoon.tea" "evening"
## [4] "after.lunch" "after.dinner" "anytime"
## [7] "home" "work" "tearoom"
## [10] "friends" "restaurant" "pub"
## [13] "variety" "how" "sugar"
## [16] "format" "place.of.purchase" "type"
## [19] "sex" "profession" "sport"
## [22] "age" "age_Q" "frequency"
## [25] "exotic" "spirituality" "good.for.health"
## [28] "diuretic" "friendliness" "iron.absorption"
## [31] "feminine" "refined" "slimming"
## [34] "stimulant" "relaxant" "no.effect.health"
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ afternoon.tea : Factor w/ 2 levels "afternoon.tea",..: 2 2 1 2 2 2 1 1 1 2 ...
## $ evening : Factor w/ 2 levels "evening","not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ after.lunch : Factor w/ 2 levels "after.lunch",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ after.dinner : Factor w/ 2 levels "after.dinner",..: 2 2 1 1 2 1 2 2 2 2 ...
## $ anytime : Factor w/ 2 levels "anytime","not.anytime": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ restaurant : Factor w/ 2 levels "not.restaurant",..: 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ variety : Factor w/ 3 levels "black","flavoured",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ how : Factor w/ 4 levels "lemon","milk",..: 3 2 3 3 3 3 3 2 2 3 ...
## $ sugar : Factor w/ 2 levels "not.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ format : Factor w/ 3 levels "loose","sachet",..: 2 2 2 2 2 2 2 2 3 3 ...
## $ place.of.purchase: Factor w/ 3 levels "specialist.shop",..: 2 2 2 2 2 2 2 2 3 3 ...
## $ type : Factor w/ 6 levels "cheapest","known.brand",..: 5 6 6 6 6 4 6 6 3 3 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ profession : Factor w/ 7 levels "employee","management",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ sport : Factor w/ 2 levels "not.sporty","sporty": 2 2 2 1 2 2 2 2 2 1 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1 to 2/week",..: 2 2 4 2 4 2 3 1 4 4 ...
## $ exotic : Factor w/ 2 levels "exotic","not.exotic": 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ good.for.health : Factor w/ 2 levels "good for health",..: 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ refined : Factor w/ 2 levels "not.refined",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "not.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ stimulant : Factor w/ 2 levels "not.stimulant",..: 1 2 1 1 1 1 1 1 1 1 ...
## $ relaxant : Factor w/ 2 levels "not.relaxant",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ no.effect.health : Factor w/ 2 levels "no.effect.health",..: 2 2 2 2 2 2 2 2 2 2 ...
summary(tea)
## breakfast afternoon.tea evening
## breakfast :144 afternoon.tea :169 evening :103
## not.breakfast:156 Not.afternoon.t:131 not.evening:197
##
##
##
##
##
## after.lunch after.dinner anytime
## after.lunch : 44 after.dinner : 21 anytime :103
## not.after.lunch:256 not.after.dinner:279 not.anytime:197
##
##
##
##
##
## home work tearoom friends
## home :291 not.work:213 not.tearoom:242 friends :196
## not.home: 9 work : 87 tearoom : 58 not.friends:104
##
##
##
##
##
## restaurant pub variety how
## not.restaurant:221 not.pub:237 black : 74 lemon : 33
## restaurant : 79 pub : 63 flavoured:193 milk : 63
## green : 33 nothing.added:195
## other : 9
##
##
##
## sugar format place.of.purchase
## not.sugar:155 loose : 36 specialist.shop : 30
## sugar :145 sachet :170 supermarket :192
## sachet+loose: 94 supermarket+specialist.: 78
##
##
##
##
## type sex profession sport
## cheapest : 7 F:178 employee :59 not.sporty:121
## known.brand: 95 M:122 management :40 sporty :179
## luxury : 53 manual labourer :12
## shop.brand : 21 other work :20
## unknown : 12 senior management:35
## varies :112 student :70
## unemployed :64
## age age_Q frequency exotic
## Min. :15.00 15-24 :92 1 to 2/week : 44 exotic :142
## 1st Qu.:23.00 25-34 :69 1/day : 95 not.exotic:158
## Median :32.00 35-44 :40 3 to 6/week : 34
## Mean :37.05 45-59 :61 more than 2/day:127
## 3rd Qu.:48.00 60 and +:38
## Max. :90.00
##
## spirituality good.for.health diuretic
## not.spirituality:206 good for health :210 diuretic :174
## spirituality : 94 not.good for health: 90 not.diuretic:126
##
##
##
##
##
## friendliness iron.absorption feminine
## friendliness :242 iron absorption : 31 feminine :129
## not.friendliness: 58 not.iron absorption:269 not.feminine:171
##
##
##
##
##
## refined slimming stimulant
## not.refined: 85 not.slimming:255 not.stimulant:184
## refined :215 slimming : 45 stimulant :116
##
##
##
##
##
## relaxant no.effect.health
## not.relaxant:113 no.effect.health : 66
## relaxant :187 not.without.effect.health:234
##
##
##
##
##
In Tea data most of the variables are in factor format. Data is containing 36 variables and 300 observations. Let’s have a visual overlook in data.
# Splitting the plot
colnames(tea)
## [1] "breakfast" "afternoon.tea" "evening"
## [4] "after.lunch" "after.dinner" "anytime"
## [7] "home" "work" "tearoom"
## [10] "friends" "restaurant" "pub"
## [13] "variety" "how" "sugar"
## [16] "format" "place.of.purchase" "type"
## [19] "sex" "profession" "sport"
## [22] "age" "age_Q" "frequency"
## [25] "exotic" "spirituality" "good.for.health"
## [28] "diuretic" "friendliness" "iron.absorption"
## [31] "feminine" "refined" "slimming"
## [34] "stimulant" "relaxant" "no.effect.health"
tea_subset_1 <- dplyr::select(tea, breakfast:pub)
tea_subset_2 <- dplyr::select(tea, variety:frequency) %>% select(-age)
tea_subset_3 <- dplyr::select(tea, exotic:no.effect.health)
# visualize the dataset
gather(tea_subset_1) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
gather(tea_subset_2) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
gather(tea_subset_3) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
The last 12 variables are all categorical variables in “yes - now” dimension. They all are represnting reasons to consume tea. In this part of analysis we are interested in these variables.
# Lets select subset of data. Why people are drinking tea?
tea_time <- dplyr::select(tea, exotic, spirituality, good.for.health, diuretic, friendliness, iron.absorption, feminine, refined, slimming, stimulant, relaxant, no.effect.health)
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
For categorical varibles there is technique which is closely corresponding the PCA, Multiple Correspondence Analysis (MCA). More about MCA can be read from here.
# multiple correspondence analysis
mca <- MCA(tea_time, graph = FALSE)
# summary of the model
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.141 0.121 0.104 0.098 0.096 0.081
## % of var. 14.140 12.066 10.407 9.788 9.606 8.136
## Cumulative % of var. 14.140 26.207 36.613 46.401 56.007 64.143
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11 Dim.12
## Variance 0.073 0.066 0.063 0.059 0.052 0.045
## % of var. 7.271 6.626 6.280 5.905 5.237 4.537
## Cumulative % of var. 71.414 78.040 84.321 90.226 95.463 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.691 1.124 0.425 | -0.345 0.329 0.106 | -0.384
## 2 | -0.307 0.222 0.081 | 0.155 0.066 0.020 | -0.565
## 3 | -0.215 0.109 0.071 | -0.518 0.742 0.411 | -0.255
## 4 | -0.123 0.036 0.015 | -0.284 0.223 0.079 | 0.539
## 5 | -0.190 0.085 0.037 | -0.038 0.004 0.001 | 0.321
## 6 | -0.608 0.872 0.357 | -0.578 0.922 0.322 | -0.036
## 7 | -0.608 0.872 0.357 | -0.578 0.922 0.322 | -0.036
## 8 | 0.048 0.005 0.004 | -0.448 0.554 0.347 | 0.161
## 9 | -0.612 0.883 0.368 | -0.205 0.116 0.041 | 0.308
## 10 | -0.215 0.109 0.071 | -0.518 0.742 0.411 | -0.255
## ctr cos2
## 1 0.473 0.131 |
## 2 1.023 0.274 |
## 3 0.209 0.100 |
## 4 0.931 0.285 |
## 5 0.329 0.105 |
## 6 0.004 0.001 |
## 7 0.004 0.001 |
## 8 0.083 0.045 |
## 9 0.303 0.093 |
## 10 0.209 0.100 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## exotic | 0.214 1.283 0.041 3.515 | 0.394 5.087 0.140
## not.exotic | -0.193 1.153 0.041 -3.515 | -0.355 4.572 0.140
## not.spirituality | -0.236 2.253 0.122 -6.039 | -0.028 0.037 0.002
## spirituality | 0.517 4.937 0.122 6.039 | 0.061 0.081 0.002
## good for health | 0.314 4.064 0.230 8.290 | -0.350 5.915 0.285
## not.good for health | -0.732 9.483 0.230 -8.290 | 0.816 13.801 0.285
## diuretic | 0.406 5.637 0.228 8.252 | 0.122 0.594 0.020
## not.diuretic | -0.561 7.785 0.228 -8.252 | -0.168 0.820 0.020
## friendliness | 0.156 1.161 0.102 5.519 | -0.008 0.004 0.000
## not.friendliness | -0.652 4.842 0.102 -5.519 | 0.034 0.015 0.000
## v.test Dim.3 ctr cos2 v.test
## exotic 6.466 | 0.426 6.871 0.163 6.979 |
## not.exotic -6.466 | -0.383 6.175 0.163 -6.979 |
## not.spirituality -0.712 | -0.236 3.072 0.122 -6.050 |
## spirituality 0.712 | 0.518 6.731 0.122 6.050 |
## good for health -9.239 | -0.200 2.243 0.093 -5.284 |
## not.good for health 9.239 | 0.467 5.234 0.093 5.284 |
## diuretic 2.475 | -0.326 4.944 0.147 -6.630 |
## not.diuretic -2.475 | 0.451 6.828 0.147 6.630 |
## friendliness -0.287 | -0.014 0.012 0.001 -0.485 |
## not.friendliness 0.287 | 0.057 0.051 0.001 0.485 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## exotic | 0.041 0.140 0.163 |
## spirituality | 0.122 0.002 0.122 |
## good.for.health | 0.230 0.285 0.093 |
## diuretic | 0.228 0.020 0.147 |
## friendliness | 0.102 0.000 0.001 |
## iron.absorption | 0.068 0.008 0.011 |
## feminine | 0.310 0.009 0.007 |
## refined | 0.215 0.030 0.090 |
## slimming | 0.275 0.055 0.003 |
## stimulant | 0.030 0.258 0.127 |
# visualize MCA
plot(mca, invisible=c("ind"), habillage = "quali")
As in PCA, dimensions are ordered by the variance. Dim 1 is capturing 14.1 % of the overall variance. Cumulatively first two variables are responding 26.2 % of the variance, meaning that major part of the data’s overall variance is captured by other dimensions.
From bixplot above we can see that most of the categorical varibles values are quite near by the origo in our 2D space. This means that two dimensional MCA model does not explain major part of the variaty in our tea data. Still, there are some values in our data making clusters, for example group of values no effect on health, not good for health and not relaxant can be ditinguished (interpretation could be “the cluster of pessimists”). Dim 1 could be seen as “effect of the tea” dimension where values in left are not indentifying any effect of the and in the right values are more devoted on positive effect of tea.